Length of games in Major League Baseball by Jason Bowles

Introduction Of this Analysis

The motivation for this came while listening to a recent baseball game on the radio earlier this year.

The announcers were complaining about the number of balls being called by the umpire that night. One of the announcers made the bold statement, “If you want to shorten the length of baseball games.. CALL MORE STRIKES”. I thought this was interesting and wondered if there is data that I can explore that can tell me exactly what is driving the length of games in major league baseball. I found game log information and decided to see for myself.

This is a custom dataset pulled from http://www.retrosheet.org/.

From their “game logs” section http://www.retrosheet.org/gamelogs/index.html, I pulled only games from 1950 to 2014 so as to keep my observations down to under 150K. A summary of the variables found in this dataset can be found here: http://www.retrosheet.org/gamelogs/glfields.txt. For this analysis we’ll narrow it down to the following fields:

date,starting_hof,era,game_number,day_of_week,visiting_team, visiting_league,visiting_game_number, home_team, home_league, home_game_number,innings_played,visit_score,home_score,game_outs,day_night_ind,attendance,time_of_game,visit_line_score,home_line_score,visit_AB, visit_hits,visit_2B,visit_3B,visit_HR,visit_RBI,visit_SACH,visit_SAC,visit_HBP,visit_BB,visit_IBB, visit_SO,visit_SB, visit_CS,visit_GDP,visit_CatherInterference,visit_LOB,visit_pitchers,visit_IndER, visit_TeamER, visit_WildPitch,visit_Balk,visit_Putout,visit_assists,visit_errors,visit_passed_ball,visit_DP, visit_TriplePlay,home_AB, home_hits,home_2B,home_3B,home_HR,home_RBI,home_SACH,home_SAC,home_HBP,home_BB,home_IBB, home_SO,home_SB, home_CS,home_GDP,home_CatherInterference,home_LOB,home_pitchers,home_IndER, home_TeamER, home_WildPitch,home_Balk,home_Putout,home_assists,home_errors,home_passed_ball,home_DP, home_TriplePlay

Several variables are combined together to get “game” information.

(runs, SO, BB, AB, hits, GDP, DP, pitchers, WH (BB + Hits))

The “era” (post war, westward expansion, deadball 2, free agency/arbitration, steroids, current)

The era was added to each game as well. This information was added via the following post found on the huffington post. http://www.huffingtonpost.com/quora/what-are-the-major-eras-o_b_3547814.html

Hall of Fame pitchers

Lastly all pitchers that are inducted into the Baseball Hall of Fame are pulled from wikipedia (including the most recent 3 added Randy Johnson, Pedro Martinez, and John Smoltz). Using this information I added the number of HOF pitchers that started the game (0, 1, or 2). My initial feeling that the better quality pitching will have an affect on the “time of game”, we’ll see if the data proves me right!

Univariate Plots Section (Basically we’re trying to understand the individual variables)

First looking at the distribution of the time_of_game variable

The histogram shows that the length of game ranges from just under 100 minutes to as high 260(ish) minutes. (Remember this is only 9 inning games, extra inning games can go over 360 minutes!!!)

Now let’s summrize the the variable “time_of_game”

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    74.0   144.0   160.0   160.8   176.0   287.0

The summary shows that are initial look at the histogram was pretty darn close

Now let’s look to see how many observations include a hall of fame pitcher starting a game.

##      0      1      2 
## 102055  11712    439

With over 11,000 games started by at least 1 Hall of Fame pitcher, I think we have enough to see if they make an impact on the game length.

We’ll also check to make sure that we have adequate coverage of all the era’s.

##                 current              deadball 2 free agency/arbitration 
##                   24319                   10155                   39752 
##                post war                steroids      westward expansion 
##                    3324                   26562                   10094

Would be nice to have more “post war” era games.. but I think we’ll be fine with just over 3000 games in this era. Besides by retrosheet’s own admission the older games become less reliable on the information captured and thus have a higher chance that some variables will be missing or null.

Since we think Walks + Hits (WH) will have an impact, let’s take a look at that variable, along with double header and total_runs

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   19.00   23.00   23.84   28.00   58.00

Looking at the histogram and the summary we can see that Walks + Hits is fairly widely distributed, with a range of 2 - 58. I did look up the game that only had 2 walks + hits and it was a September 9th game in 1965 game between the Chicago Cubs and the Los Angelos Dodgers. Sandy Koufax beat Bob Hendley and the Chicago Cubs one to nothing. Koufax threw a perfect game (no walks or hits) while Hendley threw a 1 hitter and gave up 1 walk. Interesting thing is that the lone run that was scored without the benefit of a hit. Lou Johnson was walked in the 5th, then a sacrifice bunt put him on 2nd. Johnson then stole 3rd and scored on an errant throw by the catcher. Wow!

Now let’s compare on top of each other the histogram of Walks + Hits and the time of game. Maybe we’ll see something interesting here.

The distributions seem to be fairly consistent, which we’ll investigate further in the bivariate and multivariate secions of the review.

Since we have information on Double Headers, I’m curious to see how many we have to review

##      N      Y 
## 102834  11372

Better than I had hoped, we have over 11,000 games that were part of a double header!

For our last single variable plot, let’s see if the total_runs scored has the same distribution we saw with Walks+Hits. The theory being that Walk+Hits drives more run scoring

##      N      Y 
## 102834  11372

The distribution of total runs has a longer tail than Walks + Hits and some definite spikes! Which may indicate that the Walks+Hits and total runs are not as highly correlated as I would have thought

This last part is to see what kind of distribution we have. The histograms suggest that they may be normally distributed, but let’s see if it is a statistically valid assumption

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  game_logs$time_of_game
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  game_logs$WH
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

For both outputs the Pvalue is low meaning that it is statistically different than a normal distribution. In both cases there may be some outliers that cause this test to fail.

Univariate Analysis

Structure of the MLB game logs data

Each observation in this dataset is one complete game played in major league baseball from 1950 to 2014. Each game has information on the date it was played, the teams that participated and the major statistics for each team (home & visitor). Example statistices (Runs Scored, Hits, Walks [BB], Strikeouts [SO], etc..)

Other information includes day vs. night game, number of hall of fame pitchers that started the game, the day of the week and the era in which the game was played.. (i.e. deadball2, steroids, etc..)

The main feature is that of the “time_of_game”

This feature describes in minutes how long the baseball game took to complete. Recent rule changes went into effect in 2015 that were designed to shorten the overall length of a baseball game, however since the 2015 season is still in progress at the time of this writing, we’ll have to wait until the conclusion of this season to see those changes result in a downward trend for the time of game. Until then, it will be interesting to see what variables in our dataset show an impact on the length of the game. Is there some other aspects of the game that we can focus on that will help shorten its overall length?

A high level discussion on other features that may have some interesting effects on our main feature.

Initially looking it appears from logic and from the data that the number of hits and walks should have a positive relationship with the length of a major league baseball game. (those 2 things usually lead to more runs scored and thus longer game).

Other things of interest is whether or not a Hall of Fame pitcher starting a game will have a negative relationship with the length of a game. Also of interest is if the league rules and quality of play have an indication on the length of game. The NL and AL leagues of major league baseball have slightly different rules and different levels of parity in each season of baseball.

From the data we will also be able to conclude if day vs. night games are different or if a game that is part of a double header would be shorter than a regularly scheduled game.

Some new variables created from the dataset.

Created a factor to indicate if a particular observation was part of a double header. I also combined the following variables to give a “total” as this may be more interesting than the individual team totals, they are (hits, runs, BB, SO, pitchers used, Double Plays and At Bats). I also created a seperate variable for the year of the game.. the year is more easily used to describe a season total for which an observation belongs.

Major rule changes and teams added are done so before a season begins and doing a yearly observation may give some indication of how those changes have effected our main variable

At first glance there weren’t any “weird” observations. But some of the variables had to be “transformed” and I did trim the dataset

I did change the variables Starting_Hof and game_number to be factors, so that I could correctly see a summary of these in the R console.

Since are main variable is time of game, it is important to compare apples to apples.. meaning you can’t compare an extra inning game to a regular 9 inning game or a rain shortened game.

The subset of data removes any game that was either higher or lower than 9 innings

Bivariate Plots Section

What ways can we further investigate the time_of_game variable? We’ll start with a histogram adding era as the fill.

You can see in this plot how each era moves slightly to the right as it nears the current timeframe. I like this plot but Let’s facet by era instead and see how that looks.

This one puts steroids right under the deadball 2 era, showing how different those time periods were.

Over time the length of games has changed in what ways?

Here I can see that the game times have steadily increased over time, but adding the smooth shows that it isn’t completely linear. There is still a lot of noise here, so we’ll have to look at this in a different way later

Next up.. scatter plot for runs vs. game time

So here we see that total runs scored means a longer game.. but that isn’t a revolutionary finding, I’m thinking there is more to it than this.

Now we’ll take a look at Strikeouts

So here we don’t see much of a relationship, but I would have thought more strikeouts = less games and see a slightly negative relationship if anything. But I guess a game is more efficient if the picher pitches to contact because a strikeout requires 3 pitches, but a groundball out can be done with 1 pitch.

Let’s see how walks have an effect on the time of a game.

This plot looks similar to the total_runs scatter plot. A relationship is there, but again not too hard to make this conclusion.

What about hits on the length of a game?

In the previous charts we saw that low numbers of strikeouts, walks and runs still had a lot of variation. In this graph the variation is cleaned up a bit. This may lead to something.

Last we’ll examine walks + Hits

Walks + Hits is even more tight than the others in relationship to time_of_game. We’ll see if we can build on this later.

Now that we’ve exhausted all of the retro information on game time. Let’s see if the variable I added for starting HOF pitchers has an impact.

So having a Hall of Famer start the game means you are in for a shorter game for sure. And if you have the opportunity to watch two hall of famers go at it.. expect a short visit to the park that day! Very small amount of outliers too!

I added the era too, how does game time look across era?

I like the boxplot here because each box on this chart is like a histogram on it’s side and it is easer to see which era is shifted the furthest left (post war and deadball 2). The steroids era seems to be the hightest, but the current timeframe is very close.

I would guess that double headers are shorter in length then a regular game, but let’s see what the data says. First we’ll look at just a double header.. than the game number

Just what I thought, double headers are shorter.. but the crazy thing is that game 1 of the double header is consistently shorter than game 2! More on this later.

Is there a difference when it comes to the National (NL) or American (AL) leagues?

Both are close.. but the AL is slightly longer at playing baseball than the NL.

Does day of the week matter (Perhaps Sunday Night baseball games are longer)?

Nothing really stands out here! Uncanny really how even it is across days. Sunday appears to be slightly shorter (but not by much)

To help visualize this better, we’ll summarize the data by year so that we can see the median game time over the years.

When looking at it this way, we can see the median game time start reallying jumping in the early 70’s. If our dataset didn’t go back to the 60’s, I would think the relationship completely linear.

For fun let’s also see what attendance has done over the years

Attendance has increased almost every year, with some noticable dips early 1980’s and mid 90’s

I think this is as good time as any to think about how a scatterplot matrix may help give some interesting insight!

We’ll do two scatterplot matrices so that it is more readable.. we’ll keep time_of_game and year in each matrix though

Was a bit surprised that Strikeouts had any kind of positive correlation to time_of_game, We’re also seeing that runs have an impact on time of game here. I think the standard boxplots for time_of_game and starting_hof are not the best here. We can investigate the seperately

Now let’s run the 2nd set

Here is our first glimpse that Walks+Hits has a high correlation with time_of_game, but hasn’t changed much over the years. Double Headers are definetly shorter than regular games here as well. Total Pitchers used has the highest correlation with time of game. Are walks+hits driving this though?

Let’s compare walks, Strikeotus and WH across eras using a boxplot

From the bar graph we can see that the deadball era had a really low amount of walks. So maybe it wasn’t really a deadball era, but rather a pitching dominance era

We see that DeadBall 2 is near the top for strikeouts but not as the current era. The post-war era is lagging behind the others with the lowest median strikeouts per game

Strange that walks + hits doesn’t really have a dramatic difference through the years and eras.

Bivariate Analysis

Most of the features that I expected to have an impact on “time_of_game” showed the relationship, but some did not.

First thing I investigated was to see if the time of game has truly increased over time. This is a common observation from those who write regularly about the game of baseball. The line graph did show this trend, however it isn’t completely linear. There are dips in game time throughout time.

My thought is that if an umpire is calling more strikes, then the batter would be forced to swing at more pitches that are not in their “hot” zones, resulting in less overall hits. In this same reasoning than more strikes being called means less overall walks and more overrall strikes. Thus I would expect that the higher number of hits and walks the longer the game will take. I also expect that the more strikes being called would mean more strikeouts and thus have a negative relationship with the game time.

My initial conclusion about walks and hits seems to be holding true in the generated scatterplots (even more so when they were combined), however it appears that stikeouts have no relationship on the time of game.

Other things of note is that the “era” of baseball is a good reflection of the time of game.. but as noted before era is closely related to year and therefore it wasn’t a huge surprise that the eras have a similar impact on game time.

Looking across categorical features couple of interesting things was that both Number of HOF starting pitchers, double header games and the home league showed that each represent a different mean for time of game.

Other things that were interesting

I know that a lot has been written about attendance at major league baseball. I created a line graph to examine how attendance has ebbed and flowed throughout the years. There were expected drops (right after the 1994 baseball players strike and after the steriods hearings at congress). Looks like there was also a dip in the early eighties. I did a quick search and found there was also a strike in 1981 which resulted in 713 games being cancelled. Although unlike 1994 the season did resume and a World Series was still played.

I also was interested in some general stats across the different era’s. For instance the number of walks is fairly consistent across eras, except the post-war era. Strikeouts did show some variation with lower averages for post-war, deadball2 and westward expansion. I was really surprised to see strikeouts to be lower for deadball2, because that was the time period of Bob Gibson’s 1.12 ERA (Earned Run Average) and Denny McLain winning 31 games (both in 1968) and resulted in Major League Baseball lowering the pitcher’s mound and effectively ending the deadball2 era.

My thought was that strikeouts would be up, not down. So looking at walks + hits may show more of the story here. In fact when I look at this graph the deadball2 era has the lowest of any era, which means the deadball2 era was not known for strikeouts but more of pitching to contact and letting the defense do the work.

What is driving the first game of a double header to be shorter?

I completely understand the results of the effect of a HOF pitcher starting a game (better pitcher = less hits and less walks, and therefore a shorter game). However I wasn’t sure the same could be held true for a double header. But the data clearly shows that games that are a part of a double header are shorter than regular games. I can understand that the 2nd game may be shorter because the position players are all tired thus giving an advantage to the well rested game 2 pitcher, but I’m not sure why game 1 is also shorter (in fact slightly shorter than game 2). Maybe the umpires are expanding their strike zones in anticipation of a long day?? We’ll look to see if we can uncover more information about why double headers are shorter.

Multivariate Plots Section

What can we see if we’re looking at time_of_game with color added to show the era

From this graph we can see that most of the eras are consistent in their trend (upward or downward) on length of game.

What happens when we look at Walks + Hits (WH) over the years.. do they stay the same?

Walks + Hits don’t seem to vary much, but does appear to be slightly higher in the steroids era. Added color for the time of game and we can see how that is getting darker as the years progress.

When looking at Walks + Hits (WH) can we see a change in era?

From this graph I saw that relationship between walks and hits but adding in era gave us an indication of how the length of games have changed with the eras. Towards the top is the steriods and current area, with the bottom beting free agency, westward expansion, deadball 2 and post war.

To compare NL vs. AL, we’ll create a table and then display that over year so that we can see differences for time of game and total runs scored by league over the years

Over time again we see that game length has increased, but adding the dimension of home league, shows us that the American league games were consistently longer from the early 70s to about 2003.

Could this have been caused by more runs being scored over that time.. let’s see.

Yes, the AL scores more runs.. but they continue to do that in recent years, so the run scoring can’t be driving the change in length of games between the AL and NL.

Looks like the AL is consistently higher for game_time, let’s create a ratio to see how that has changed over the years

##   year  AL    NL
## 1 1950 140 136.0
## 2 1951 139 138.0
## 3 1952 141 139.5
## 4 1953 141 143.0
## 5 1954 143 147.0
## 6 1955 148 146.0

This graph does a good job of clarifying the time period of the game length differences between the AL and NL (If I had more time, I’d figure out how to make the 1 pop out as the guiding line)

Look at a summary by year of day vs. night games

Absolutely nothing of interest here. Pretty consistent with short periods where Night games were shorter than day games and vice versa.

Back to Time_of_game to further explore relationships

Has Walks + Hits (WH) changed over time? Could we use that to show why game times have increased over time?

Here we see that the relationship of Walks+Hits is consistent across eras, and the free agency/arbitration era shows that most of the longer games were in the AL. The other eras were fairly consistent with game time across leagues.

Going back to double headers.. can we find something in the data to describe the relationship of the 1st game to the game time?

Even with sampling (to preven over plotting), we are not seeing a difference for walks+hits across double headers and regular games

Multivariate Analysis

Is it Walks + Hits that is really driving game time?

As I continued to analyze the Walks + Hits (WH) impact on the time of game I wasn’t able to make the connection between WH going up over the years. It is clear that WH has a positive relationship with game time.. but that hasn’t really changed dramatically over the years. It was at it’s lowest in the deadball2 era, but that wasn’t the shortest median game time.

Looking for more answers I considered the differences across leagues. I found that the American League games were consistently taking longer starting in the early 70’s. This isn’t a surprise as this is about the time that the American League instituted the Designated Hitter rule, essentially replacing a sub .200 average hitter usually with a hitter who batted in the lower .300 with power. This seems to have had an impact on runs scored and time of game. This affect has lessened in the current era perhaps because of more parity and better pitching/defense in recent years (i.e. Defensive Shifts)

Another consideration was if there was any difference between game time of day and night games. My initial assumption was that day games would be shorter because of less television commercial breaks, however there doesn’t appear to be a strong difference in any years or eras.

I continued to look at time of game across era’s and NL vs. AL, still nothing stood out as even WH seemed to have the same relationship with time of game across eras.

Feeling a bit deflated, I decided to dig into the double header issue more. Why would the 1st game of a double header be shorter than the 2nd game and a regular season game?

## Source: local data frame [3 x 6]
## 
##   game_number median_so median_wh median_gt mean_pitchers      n
## 1           0        11        23       161      5.976671 102834
## 2           1        10        23       149      4.845499   5754
## 3           2        10        23       150      5.176041   5618

Interesting! Walks+Hits is exactly the same across game number, but there is still a difference in median game time, But really jumps out from the summary is the mean number of pitchers used.

Let’s take this a bit further now.

Here we can see over the years the number pitchers used has steadily increased, and game 1 of the double header is always lower!

We’ll look at this a couple different ways to make sure.

This seems to confirm that the distribution of double header games are shifted to the left of regular season games

Now let’s generate the scatter plot

Yes, I high correlation between total pitchers used and time of game.

To get here I created a table to look at some of the different values for double headers. I was surprised to find that the median WH stat was identical across the 3 game types! How could this be when WH is highly correlated with game time? I would have thought that this would be lower in game 1 of the double header compared to game 2 and the regular season games.

After I added a summary for total_pitchers used across all of these games did something jump out at me. The average number of pitchers used in game 1 of a double header was 4.84 compared to 5.17 of game 2 and 5.97 of a regular season game!

From here I ran a series of graphs to confirm what I think I found! First a scatteplot to show the positive relationship between total_pitchers and time_of_game.

Now let’s switch to a line graph without the game numbers

When I look at the average number of pitchers used in a game across years, I see something that looks very familiar to a graph that I created earlier. Comparing the total_pitchers to median game time through the years created remarkably similar graphs.

##   (2,5]   (5,8]  (8,13] (13,18]    NA's 
##   50187   46995   13575      81    3368

In this graph we can see that Walks+Hits hasn’t changed much over the years (like our graph with game time as the color), but instead of game_time as the color, I used the pitching factor that I created. And the output looks really good and explains a lot. We’ll definitely use this as one of the final plots!

Were there any interesting or surprising interactions between features?

Of surprise was the way pitching staffs have been used over time. In the early years of the dataset you can see that a higher number of pitchers used is related to the number of Walks + Hits given up. However starting around 1974, more pitching changes were starting to be made regardless of the walks + Hits currently given up. Further research indicated that Major League Baseball rewrote it’s definition of the save statistic and modern bullpen usauge started to take shape.

For example, starting pitchers in the post-war and deadball2 eras consistently threw more innnings than today’s starters (Example: Bob Gibson 300+ innings in 1968). Today’s starters routinely pitch 6 innings and are replaced by situational specialist from the bullpen in the 7th and 8th innings followed by the best pitcher for the 9th inning. (Consider Aroldis Chapman record holder for the fastest pitch ever thrown who pitches the 9th inning for the Reds)

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create a model for my dataset, as new rule changes in affect this year in baseball would make a model based on past years useless. However we can use this dataset to compare to the final 2015 game data to detect if the rule changes had the impact that Major League Baseball has intended.


Final Plots and Summary

Plot One

Description One

This plot shows how I came to realize that it wasn’t a well rested pitcher facing a tired offense that drove shorter game times. What I realized that in a double header you don’t have the opportunity to rest pitchers between games, and therefore a manager would be less inclined to make pitching changes in game 1 of the double header just in case game 2 warranted more pitching help.

Also in game 2, you can assume that all pitchers from game 1 would likely be off limits for use in that game. Therefore by default the use of pitchers would be less than a non-double header game

The Y axis focuses on the number of pitchers used across the years (x axis). The color indicates whether the game was part of a double header or not. Size of the point represents the median game time.

Plot Two

Description Two

The first plot showed the difference between double headers and regular games. This one focuses on the use of pitchers throughout the years(x axis) and its affect on the length of the game (y axis). The graph shows there is a high correlation between the number of pitchers used and the length of a baseball game. You can also see by the coloring of the scatterplot that games from an earlier time period were more likely to end a game with only having used 2,3 and 4 pitchers. At 5 pitchers more current games start to dominate in this visualization.

To prevent overplotting in this case we used only 10,000 of the over 114,000 available observations.

It was only after examining why the 1st game of the double header was shorter than any other game on average that the total pitchers used relationship revealed itself.

Plot Three

This plot will focus on how pitching changes have been done through the years when considering the number of walks + hits in a game

Description Three

The initial hypothesis for this investigation was that Walks + Hits would be the ultimate driver of the length of a baseball game. The graph shows that total Walks + Hits hasn’t changed over the years much (if at all), but what has changed is the number of pitchers used. Again you can see from this graph that pitching changes were highly correlated with Walks + Hits in the early years but the trend starting around 1974 (with the redefinition of a “save”), was to make more pitching changes regardless of the current pitchers stats at the time of the change. Managers in recent years are more prone to try and take advantage of matchups after their starter has gone 5 or more innings.

Reflection

First of all I must admit that I enjoy baseball above all other sports and that helped drive my curiosity more than anything. But what I found most interesting is how much my pre-conceived notions about the data drove my initial research. If I hadn’t been driven to figure out why game 1 of a double header was shorter than any other game, I don’t think I would have found the relationship of pitching changes to the time of game.

The Walks + Hits variable was actually very misleading because there is a realtionship for sure, but that relationship only seems to hold up within a given year. It was very interesting to find that even though game length has increased year after year.. that Walks + Hits did not. So finding that pitching changes increasing over the years was quite exciting.

Some further analysis could compare teams over the last 20 years to see if the teams with the highest game times are also the teams that make the most pitching changes. It would also be interesting to compare the average number of innings pitched by a starting pitcher over time. My guess is that the number has slowly declined from just over 8 innings to just over 6 innings.

The last analysis I would do with this dataset would to compare the length of games managed by Tony Larussa compared to all other managers in major league baseball. Being a cardinal fan I had grown accustomed to what was known as the “Parade of Relievers” when LaRussa managed. I would be curious to find out if his managing tactics were so bad when compared to others in the major leagues in the same time period.

Thanks for Reading. Jason Bowles